Multiple Agent Event Detection and Representation in Videos

نویسندگان

Asaad Hakeem

Mubarak Shah

چکیده

We propose a novel method to detect events involving multiple agents in a video and to learn their structure in terms of temporally related chain of sub-events. The proposed method has three significant contributions over existing frameworks. First, we present the concept of a video event graph, to learn the event structure from training videos. The video event graph is composed of temporally correlated sub-events, which is used to automatically encode the event correlation graph. The event correlation graph signifies the frequency of occurrence of conditionally dependent sub-events. Second, we pose the problem of event detection in novel videos as clustering the maximally correlated sub-events, and use normalized cuts to determine these clusters. The principal assumption made in this work is that the events are composed of highly correlated chain of sub-events, that have high weights (association) within the cluster and relatively low weights (disassociation) between clusters. Last, we recognize the importance of representing the variations (in the temporal order of sub-events) occurring in a event and encode the probabilities directly into our representation. We show results of our learning and detection of events for videos in the meeting, surveillance, and railroad monitoring domains. Introduction The world that we live in is a complex network of agents and their interactions which we term events. These interactions can be visualized in the form of a hierarchy of events and sub-events. An instance of an event is a composition of directly measurable low-level actions (which we term sub-events) having a temporal order. For example, a voting event is composed of a sequence of move, raise and lower hand sub-events. Also, the agents can act independently (e.g. voting) as well as collectively (e.g. touchdown in a football game) to perform certain events. Hence, in the enterprise of machine vision, the ability to detect and learn the observed events must be one of the ultimate goals. In literature, a variety of approaches have been proposed for the detection of events in video sequences. Most of these approaches can be arranged into three categories based on their approach to event detection. First, approaches where event models are pre-defined include force dynamics Copyright c © 2005, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. (Siskind 2000), stochastic context free grammars (Bobick and Ivanov 1998), state machines (Koller, Heinze, and Nagel 1991), and PNF Networks (Pinhanez and Bobick 1998). These approaches either manually encode the event models or provide constraints (grammar or rules) to detect events in novel videos. Second, approaches that learn the event models such as Hidden Markov Models (HMMs) (Ivanov and Bobick 2000, Brand and Kettnaker 2000), Coupled HMMs (Oliver, Rosario, and Pentland 1999), and Dynamic Bayesian Networks (Friedman, Murphy, and Russell 1998) have been widely used in the area of activity recognition. The above learning methods either model single person activities or require prior knowledge about the number of people involved in the events and variation in data may require complete re-training, so as to modify the model structure and parameters to accommodate those variations. Similarly, there is no straight-forward method of expanding the domain to other events, once training has been completed. Third, approaches that do not model the events, but utilize clustering methods for event detection include co-embedding prototypes (Zhong, Shi, Visontai 2004), and spatio-temporal derivatives (Zelnik-Manor and Irani 2001). These methods find event segments by spectral graph partitioning (e.g. normalized cut) of the weight (similarity) matrix. These methods assume maximum length of an event and are restricted to single person non-interactive event detection. What is missing in these approaches is ability to model long complex events involving multiple agents performing multiple actions simultaneously. Can these approaches be used to automatically learn events involving unknown number of agents? Will the learnt event model still hold for a novel video, in case of interfering events from an independent agent? Can these approaches extend their abstract event model to representations related to human understanding of events? Can a human communicate his or her observation of an event to a computer or vice versa? These questions are addressed in this paper, where event models are learnt from training data, and are used for event detection in novel videos. Event learning is formulated in a probabilistic framework while event detection is treated as a graphtheoretic clustering problem. The primary objective of this work is to detect and learn the complex interactions of the multiple agents performing multiple actions in the form of domain events, without prior knowledge about the number Figure 1: Automated detection of sub-events for stealing video. Using the tracked trajectories, the sub-events of each agent are detected, and frames 37, 119, 127, and 138 of the video are shown. of agents involved in the interaction and length of the event. Another objective is to present a coherent representation of these domain events, as a means to encode the relationships between agents and objects participating in a domain event. Formally, a domain event is defined as a collection of actions performed by one or more agents. Also, we term these actions as video events, since they are directly measurable from the video (e.g. move, pick, enter, etc.). In this paper, events refer to domain events, and sub-events refer to video events, unless otherwise stated. Although CASE (Hakeem, Sheikh, Shah 2004) is an existing multiple agent event representation, the proposed method caters for three of its shortcomings. Firstly, we automatically learn the domain event structure from training videos and encode the domain event ontology. This has a significant advantage, since the domain experts need not go through the tedious task of determining the structure of events by browsing all the videos in the domain. Secondly, we recognize the importance of representing the variations in the temporal order of the sub-events occurring in a domain event and encode it directly into our representation. These variations in the temporal order of sub-events occur due to the style of execution of events for different agents. Finally, we present the concept of a video event graph (instead of event-tree) for event detection in videos. The reason for departing from the temporal event-tree representation of the video is that it fails to detect events when there are interfering sub-events from an independent agent, present in the tree structure of the novel video, which were not present in the actual event tree structure. Also, it fails to represent the complete temporal order between sub-events, which can easily be represented by video event graphs. For learning the domain events from training videos, firstly, we introduce the notion of video event graph, which is a Directed Acyclic Graph (DAG) for representing the temporal relationship of sub-events in a video. In the video event graph each vertex represents a sub-event and each directed edge provides the temporal relationship between two sub-events. These temporal relationships are based on the interval algebra in (Allen and Ferguson 1994), which is a more descriptive model of relationships compared to the low level abstract relationship model of HMMs. Secondly, using the video event graph, we determine the event correlamoves enters moves raises DURING

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning, Detection, Representation, Indexing and Retrieval of Multi-Agent Events in Videos

The world that we live in is a complex network of agents and their interactions which are termed as events. An instance of an event is composed of directly measurable low-level actions (which I term sub-events) having a temporal order. Also, the agents can act independently (e.g. voting) as well as collectively (e.g. scoring a touch-down in a football game) to perform an event. With the dawn of...

متن کامل

Learning, detection and representation of multi-agent events in videos

In this paper, we model multi-agent events in terms of a temporally varying sequence of sub-events, and propose a novel approach for learning, detecting and representing events in videos. The proposed approach has three main steps. First, in order to learn the event structure from training videos, we automatically encode the sub-event dependency graph, which is the learnt event model that depic...

متن کامل

CASEE: A Hierarchical Event Representation for the Analysis of Videos

A representational gap exists between low-level measurements (segmentation, object classification, tracking) and high-level understanding of video sequences. In this paper, we propose a novel representation of events in videos to bridge this gap, based on the CASE representation of natural languages. The proposed representation has three significant contributions over existing frameworks. First...

متن کامل

Action Change Detection in Video Based on HOG

Background and Objectives: Action recognition, as the processes of labeling an unknown action of a query video, is a challenging problem, due to the event complexity, variations in imaging conditions, and intra- and inter-individual action-variability. A number of solutions proposed to solve action recognition problem. Many of these frameworks suppose that each video sequence includes only one ...

متن کامل